Bottom-k document retrieval
نویسندگان
چکیده
We consider the problem of retrieving the k documents from a collection of strings where a given pattern P appears least often. This has potential applications in data mining, bioinformatics, security, and big data. We show that adapting the classical linear-space solutions for this problem is trivial, but the compressed-space solutions are not easy to extend. We design a new solution for this problem that matches the bestknown result when using 2|CSA|+ o(n) bits, where CSA is a Compressed Suffix Array. Our structure answers queries in the time needed by the CSA to find the suffix array interval of the pattern plus O(k lg k lg n) accesses to suffix array cells, for any constant > 0.
منابع مشابه
Document Image Retrieval Based on Keyword Spotting Using Relevance Feedback
Keyword Spotting is a well-known method in document image retrieval. In this method, Search in document images is based on query word image. In this Paper, an approach for document image retrieval based on keyword spotting has been proposed. In proposed method, a framework using relevance feedback is presented. Relevance feedback, an interactive and efficient method is used in this paper to imp...
متن کاملImproved Skips for Faster Postings List Intersection
Information retrieval can be achieved through computerized processes by generating a list of relevant responses to a query. The document processor, matching function and query analyzer are the main components of an information retrieval system. Document retrieval system is fundamentally based on: Boolean, vector-space, probabilistic, and language models. In this paper, a new methodology for mat...
متن کاملImproved Skips for Faster Postings List Intersection
Information retrieval can be achieved through computerized processes by generating a list of relevant responses to a query. The document processor, matching function and query analyzer are the main components of an information retrieval system. Document retrieval system is fundamentally based on: Boolean, vector-space, probabilistic, and language models. In this paper, a new methodology for mat...
متن کاملOn some document clustering algorithms for data mining
We consider the problem of clustering large document sets into disjoint groups or clusters. Our starting point is recent literature on effective clustering algorithms, specifically Principal Direction Divisive Partitioning (PDDP), proposed by Boley in [1] and Spherical k-Means (“S–kmeans” for short) proposed by Dhillon and Moda in [4]. In this paper we study and evaluate the performance of thes...
متن کاملCase Studies in Ontology-Driven Document Enrichment
In this paper we present an approach to document enrichment, which consists of associating formal knowledge models to archives of documents, to provide intelligent knowledge retrieval and (possibly) additional knowledge services, beyond what is available using 'standard' information retrieval and search facilities. The approach is ontology-driven, in the sense that the construction of the knowl...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- J. Discrete Algorithms
دوره 32 شماره
صفحات -
تاریخ انتشار 2015